Skip to content

[CASCL-1304] Snapshot node managers in ConfigMap during install#2945

Merged
L3n41c merged 4 commits intomainfrom
lenaic/CASCL-1304-list-node-manager
Apr 28, 2026
Merged

[CASCL-1304] Snapshot node managers in ConfigMap during install#2945
L3n41c merged 4 commits intomainfrom
lenaic/CASCL-1304-list-node-manager

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Apr 27, 2026

What does this PR do?

At the very end of kubectl datadog autoscaling cluster install, classify every node by its current management method (Fargate / Karpenter / EKS managed node group / EC2 ASG / standalone / unknown) and persist the snapshot as YAML in a dd-cluster-info ConfigMap in the Karpenter namespace. Also detects whether a legacy cluster-autoscaler Deployment is running (and where) and records it as a sibling field in the same ConfigMap.

The classification follows the decision tree documented in the Confluence page EKS Node Management Identification Guide — Signals, Tags and Datadog Visibility.

ConfigMap shape:

apiVersion: v1
clusterName: ...
generatedAt: ...
nodeManagement:
  fargate:
    <fargate-profile>: [<node-name>...]
  karpenter:
    <nodepool>: [...]
  eksManagedNodeGroup:
    <nodegroup>: [...]
  asg:
    <asg-name>: [...]
  standalone:
    "": [...]
  unknown:
    "<providerID>": [...]
clusterAutoscaler:
  present: true
  namespace: kube-system
  name: cluster-autoscaler

Motivation

Once Karpenter is up, the user has to migrate workloads off the existing nodes onto the new Karpenter-managed ones. The right migration action depends on how each existing node is currently managed (scale the EKS managed node group to 0, delete the Fargate profile, stop the legacy cluster-autoscaler first, etc.). Today the install command finishes without leaving any record of the cluster's pre-Karpenter topology, forcing a follow-up migration tool (or the user) to re-discover it from scratch.

This PR persists the topology in a ConfigMap so the upcoming migration step has a ground truth to drive from. Tracked in CASCL-1304 under epic CASCL-920.

Additional Notes

  • Errors from the snapshot step are logged as warnings and do not fail the install — Karpenter is already up at that point and the snapshot is informational; failing the command would force a confusing rerun of the whole pipeline.
  • New IAM permission required on the operator user's identity: autoscaling:DescribeAutoScalingInstances (batched at the documented 50-ID per call limit).
  • New package cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo placed under common/ (not install/) so a future uninstall-side reader does not need to import the install package.
  • The clusterAutoscaler field is a sibling of nodeManagement so future cluster-level metadata can be added without breaking the schema (apiVersion: v1 is set explicitly to allow versioned evolution).
  • The migration step that consumes the ConfigMap is out of scope for this PR.

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: N/A (kubectl plugin only — no Agent/Cluster Agent code path is touched)
  • Cluster Agent: N/A

Describe your test plan

  • Unit tests under cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo/ cover the empty-cluster, all-six-buckets, ASG-batching (75 nodes → 2 batches of 50+25), cluster-autoscaler detection by name / app.kubernetes.io/name label / k8s-app label / container-image substring, scaled-to-zero CA, and Persist create/update round-trips. Run with go test ./cmd/kubectl-datadog/autoscaling/cluster/common/clusterinfo/....
  • Manual e2e on a real EKS cluster:
    make kubectl-datadog
    AWS_PROFILE=... AWS_REGION=eu-west-3 \
      kubectl datadog autoscaling cluster install --cluster-name <test-cluster>
    kubectl get cm -n dd-karpenter dd-cluster-info -o yaml
    Cross-check the snapshot against kubectl get nodes -o wide and aws eks list-nodegroups.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

Once Karpenter is installed on an existing EKS cluster, the user must
migrate workloads off the existing nodes. The right migration action
depends on how each node is currently managed (Fargate / Karpenter /
EKS managed node group / EC2 ASG / standalone) and whether a legacy
cluster-autoscaler is still running. Capture that topology in a
`dd-cluster-info` ConfigMap so the follow-up migration step has a
ground truth to drive from.

Errors are logged as a warning and do not fail `install`: the snapshot
is informational, and Karpenter is already up by the time it runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run `make update-golang` to align test/e2e/go.mod and go.sum with the
indirect aws-sdk-go-v2 / smithy-go patch bumps pulled in by `go work
sync`. Required by the GitLab `check-golang-version` job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 27, 2026

Here is the output of the manual e2e test:

$ kubectl -n dd-karpenter get cm/dd-cluster-info -o yaml
kind: ConfigMap
apiVersion: v1
data:
  cluster-info: |
    apiVersion: v1
    clusterName: lenaic-karpenter-test
    generatedAt: 2026-04-27T14:08:30.1565563Z
    nodeManagement:
        asg:
            eksctl-lenaic-karpenter-test-nodegroup-ng-bar-NodeGroup-awxH2qriZxFV:
                - ip-10-11-233-63.eu-west-3.compute.internal
        eksManagedNodeGroup:
            ng:
                - ip-10-11-232-138.eu-west-3.compute.internal
                - ip-10-11-234-250.eu-west-3.compute.internal
            ng-foo:
                - ip-10-11-234-249.eu-west-3.compute.internal
        fargate:
            "":
                - fargate-ip-10-11-234-123.eu-west-3.compute.internal
                - fargate-ip-10-11-234-246.eu-west-3.compute.internal
        karpenter:
            dd-karpenter-3qdyk:
                - ip-10-11-234-209.eu-west-3.compute.internal
                - ip-10-11-234-23.eu-west-3.compute.internal
                - ip-10-11-234-253.eu-west-3.compute.internal
                - ip-10-11-234-37.eu-west-3.compute.internal
    clusterAutoscaler:
        present: false
metadata:
  creationTimestamp: "2026-04-27T14:08:30Z"
  labels:
    app.kubernetes.io/managed-by: kubectl-datadog
  name: dd-cluster-info
  namespace: dd-karpenter
  resourceVersion: "1321418"
  uid: 18e5aec6-2a90-4c5d-96b4-57b2100521a9

The new clusterinfo package imports
`github.com/aws/aws-sdk-go-v2/service/autoscaling` for
DescribeAutoScalingInstances batching, so the third-party license
manifest needs the matching entry. Patch matches the diff produced by
the CI's `make verify-licenses` step verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 27, 2026

@codex review

@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented Apr 27, 2026

Code Coverage

🎯 Code Coverage (details)
Patch Coverage: 82.01%
Overall Coverage: 41.24% (+0.21%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: ec85702 | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 82.55034% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.12%. Comparing base (d5f00bf) to head (ec85702).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...autoscaling/cluster/common/clusterinfo/classify.go 88.98% 7 Missing and 6 partials ⚠️
...ctl-datadog/autoscaling/cluster/install/install.go 0.00% 10 Missing ⚠️
.../autoscaling/cluster/common/clusterinfo/persist.go 90.00% 1 Missing and 1 partial ⚠️
...adog/autoscaling/cluster/common/clients/clients.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2945      +/-   ##
==========================================
+ Coverage   40.91%   41.12%   +0.21%     
==========================================
  Files         324      326       +2     
  Lines       28743    28894     +151     
==========================================
+ Hits        11760    11883     +123     
- Misses      16129    16150      +21     
- Partials      854      861       +7     
Flag Coverage Δ
unittests 41.12% <82.55%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...adog/autoscaling/cluster/common/clients/clients.go 12.71% <0.00%> (-0.11%) ⬇️
.../autoscaling/cluster/common/clusterinfo/persist.go 90.00% <90.00%> (ø)
...ctl-datadog/autoscaling/cluster/install/install.go 15.31% <0.00%> (-0.46%) ⬇️
...autoscaling/cluster/common/clusterinfo/classify.go 88.98% <88.98%> (ø)

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5f00bf...ec85702. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a5a25b67c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return err
}

if err = recordClusterInfo(ctx, cli, clusterName, karpenterNamespace); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Snapshot before creating installer-managed nodes

In the default --install-mode=fargate path, installHelmChart has already waited for the Karpenter controller before this call, and the Fargate stack selects the Karpenter namespace, so EKS can already have created the new dd-karpenter-<cluster> Fargate node/profile by the time the snapshot is taken. That makes the ConfigMap include installer-created Karpenter infrastructure as if it were pre-existing topology, and a migration reader using this snapshot to retire legacy Fargate profiles could target the profile that keeps Karpenter running; take the snapshot before installing/waiting for Karpenter or filter the installer-created profile.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 234ffa6.

Classify now runs before createCloudFormationStacks, so the Fargate profile and the controller node it eventually backs are not yet observable from the cluster. Persist stays at the end, where the Karpenter namespace exists. Both calls remain best-effort (warnings on failure).

@L3n41c L3n41c force-pushed the lenaic/CASCL-1304-list-node-manager branch from 234ffa6 to 6a5a25b Compare April 27, 2026 16:55
@L3n41c L3n41c marked this pull request as ready for review April 27, 2026 16:58
@L3n41c L3n41c requested a review from a team April 27, 2026 16:58
@L3n41c L3n41c requested a review from a team as a code owner April 27, 2026 16:58
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a5a25b67c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

},
}

return commonk8s.CreateOrUpdate(ctx, cli, cm)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the first cluster-info snapshot

When install is run again after a successful Fargate install, this CreateOrUpdate overwrites the existing dd-cluster-info ConfigMap. Since the Karpenter namespace/Fargate profile and controller node already exist before a rerun starts, the original pre-install topology can be replaced with installer-created nodes, so the follow-up migration may treat the dd-karpenter-* profile as legacy infrastructure. If the ConfigMap already exists, keep the original snapshot or filter installer-owned entries instead of updating it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contesting this one. The "preserve the first snapshot" approach has the same fundamental problem in the more common scenario where the user re-runs install after a partial failure (IAM hiccup, kubeconfig issue, transient AWS error, …): the "first" snapshot would freeze whatever state existed during that partially-successful first run, which may itself include installer-managed nodes — and the user would then be stuck with a wrong snapshot they cannot refresh.

Both alternatives — snapshot-before-install (#234ffa64, since reverted) and preserve-first-snapshot — buy correctness on a perfectly clean first run at the cost of locking the snapshot in cases where iteration is needed. We are accepting the late-snapshot trade-off knowingly: the migration consumer will learn to skip Karpenter-managed nodes in the dd-karpenter-* Fargate profile and the karpenter bucket. Tracking that filter in the migration tool is a one-liner and survives reruns.

Not actioning.

// ClusterAutoscaler captures whether a legacy cluster-autoscaler Deployment
// is running and where, so the migration can warn the user to stop it before
// scaling EKS managed node groups (per the Karpenter migration guide).
type ClusterAutoscaler struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: maybe the version too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ec85702. The version comes from the image tag of the matching container (the source of truth), with app.kubernetes.io/version on the Deployment / pod template as a fallback for digest-only image references. The field is omitted from the YAML when neither signal is available.

ClusterName string `yaml:"clusterName"`
GeneratedAt time.Time `yaml:"generatedAt"`
NodeManagement map[NodeManager]map[string][]string `yaml:"nodeManagement"`
ClusterAutoscaler ClusterAutoscaler `yaml:"clusterAutoscaler"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also collect if karpenter or karpenter-auto-mode was already present?

Copy link
Copy Markdown
Member Author

@L3n41c L3n41c Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already collect if some nodes are managed by Karpenter.
So, we may miss the fact that Karpenter is up and running if it hasn’t had to create any node yet.
May I suggest to add the presence of a detected Karpenter in a followup PR because I put the logic to detect “foreign” Karpenter (not installed by kubectl datadog) in another dedicated PR: #2949?

Per review feedback: the migration consumer needs to know which CA
version is running so it can branch on known-bad versions or surface
deprecation guidance. Capture the version, preferring the image tag
of the matching container (the source of truth) and falling back to
the `app.kubernetes.io/version` label set by most Helm charts. The
field is omitted from the YAML when neither signal is available
(e.g. an image referenced by digest only and no version label).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 28, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec857027f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/kubectl-datadog/autoscaling/cluster/install/install.go
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 28, 2026

/merge

@gh-worker-devflow-routing-ef8351
Copy link
Copy Markdown

gh-worker-devflow-routing-ef8351 Bot commented Apr 28, 2026

View all feedbacks in Devflow UI.

2026-04-28 13:39:41 UTC ℹ️ Start processing command /merge


2026-04-28 13:39:46 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 2h (p90).


2026-04-28 15:40:26 UTCMergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants